library(tidyverse)
library(ISLR)
library(MASS)

Here, I’ll be using the Auto dataset.

data("Auto")
attach(Auto)
fit = lm(mpg ~ horsepower)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ horsepower)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Is there a relationship between the predictor and the response?

The results show that there is a negative relationship between mpg (miles per gallon) and horsepower. For every one unit increase in horsepower, miles per gallon decreases by 0.157 units.

  1. How strong is the relationship between the predictor and the response? Based on the R squared, which over 60%, we can say there’s a strong relationship between mpg and horsepower.

  2. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

predict(fit, data.frame(horsepower = 98), interval = "confidence")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
predict(fit, data.frame(horsepower = 98), interval = "prediction")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

Diagnostics

par(mfrow = c(1,1))
plot(horsepower, mpg)
abline(fit)

par(mfrow = c(2,2))
plot(fit)

The plot of residuals versus fitted values indicates the presence of non linearity in the data. The plot of standardized residuals versus leverage indicates the presence of a few outliers (higher than 2 or lower than -2) and a few high leverage points.

Question 9

Multiple Linear Regression

pairs(Auto)

Let’s look at the correlation

cor(Auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
fit = lm(mpg ~ . -name, data = Auto)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictors and the response? There appears to be relationship between the predictores and the response since we have F statistic of 252.2 and p-value less than 0.05. At least, one predictor is significant. (non -zero)

  2. Which predictors appear to have a statistically significant relationship to the response? Displacement, weight, year and origin have a statitically significant relationship to the response.

  3. What does the coefficient for the year variable suggest?

The coefficient for the year variable suggest that cars have become more efficient over time.

Diagnostics

par(mfrow = c(2,2))
plot(fit)

plot(fit)

Interaction

a = Auto[1:8]

fit_interaction = lm(mpg ~  .*. , data = a) 


summary(fit_interaction)
## 
## Call:
## lm(formula = mpg ~ . * ., data = a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Question 10

data("Carseats")

Fit a multiple regression model to predict Sales using Price, Urban, and US.

attach(Carseats)
fit = lm(Sales ~ Price + Urban + US)

summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

For every one 1 dollar increase in price of car seat, sales decreases by 54.4 dollars on average, adjusting for Urban and US On average, sales in Urban areas is 21.9 dollars less, adusting for Price and US. On average sales in the US is 1200.57 dollars higher adjusting for Price and Urban.

Fitting a model with only the significant variables.

fit1 = lm(Sales ~ Price + US)
summary(fit1)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

The R squared for the smaller model is better.

Confidence intervals

confint(fit)
##                   2.5 %      97.5 %
## (Intercept) 11.76359670 14.32334118
## Price       -0.06476419 -0.04415351
## UrbanYes    -0.55597316  0.51214085
## USYes        0.69130419  1.70984121

evidence of outliers or high leverage observations

plot(fit)

Question 11 In this problem we will investigate the t-statistic for the null hypothesis H0 : β = 0 in simple linear regression without an intercept

set.seed(1)
x = rnorm(100)
y = 2 * x + rnorm(100)

Regression of y on x without the intercept

fit = lm(y ~ x + 0)
summary(fit)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

beta coefficient = 1.9939 Standard error = 0.1065 and t value = 18.73

Regression of x on y without the intercept

fit1 = lm(x ~ y + 0)
summary(fit1)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

beta = 0.39111 standard error = 0.02089 t value = 18.73

We obtain the same value for the t-statistic and consequently the same value for the corresponding p-value. Both results in (a) and (b) reflect the same line created in (a). In other words, y=2x+ε could also be written x=0.5(y−ε).

Regression with intercept

summary(lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8768 -0.6138 -0.1395  0.5394  2.3462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03769    0.09699  -0.389    0.698    
## x            1.99894    0.10773  18.556   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16
summary(lm(x ~ y))
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90848 -0.28101  0.06274  0.24570  0.85736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03880    0.04266    0.91    0.365    
## y            0.38942    0.02099   18.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7762 
## F-statistic: 344.3 on 1 and 98 DF,  p-value: < 2.2e-16

In this case, the t statistic are equal

Question 12 Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X

When the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X

set.seed(1)

x = rnorm(100)
y = 2*x

summary(lm(y ~ x + 0))
## Warning in summary.lm(lm(y ~ x + 0)): essentially perfect fit: summary may
## be unreliable
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.776e-16 -3.378e-17  2.680e-18  6.113e-17  5.105e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 2.000e+00  1.296e-17 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm(x ~ y + 0))
## Warning in summary.lm(lm(x ~ y + 0)): essentially perfect fit: summary may
## be unreliable
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##   Estimate Std. Error   t value Pr(>|t|)    
## y 5.00e-01   3.24e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(1)

x = rnorm(100)
y = sample(x, 100)

g = data.frame(x = x, y = y)

ggplot(aes(x = g$x, y = g$y), data = g) + geom_point() + geom_line()

sum(x^2)
## [1] 81.05509
sum(y^2)
## [1] 81.05509
summary(lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.32827 -0.60584  0.00216  0.58434  2.29058 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.108130   0.090942   1.189    0.237
## x           0.006955   0.101013   0.069    0.945
## 
## Residual standard error: 0.9027 on 98 degrees of freedom
## Multiple R-squared:  4.837e-05,  Adjusted R-squared:  -0.01016 
## F-statistic: 0.00474 on 1 and 98 DF,  p-value: 0.9452
summary(lm(x ~ y))
## 
## Call:
## lm(formula = x ~ y)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33102 -0.60922  0.00922  0.57929  2.29163 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.108130   0.090942   1.189    0.237
## y           0.006955   0.101013   0.069    0.945
## 
## Residual standard error: 0.9027 on 98 degrees of freedom
## Multiple R-squared:  4.837e-05,  Adjusted R-squared:  -0.01016 
## F-statistic: 0.00474 on 1 and 98 DF,  p-value: 0.9452

Question 13

x = rnorm(100)
esp = rnorm(100, 0,sqrt(0.25))
y = -1 + 0.5 * x + esp

y is of length 100. β0 is -1, β1 is 0.5

par(mfrow = c(1,1))
plot(x, y)

fit = lm(y ~ x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2425 -0.2512  0.0136  0.3502  1.2736 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94976    0.04901  -19.38   <2e-16 ***
## x            0.49066    0.04695   10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4899 on 98 degrees of freedom
## Multiple R-squared:  0.527,  Adjusted R-squared:  0.5222 
## F-statistic: 109.2 on 1 and 98 DF,  p-value: < 2.2e-16

beta0 hat = - 0.98 and beta1 hat = 0.55 The linear regression fits a model close to the true value of the coefficients as was constructed. The model has a large F-statistic with a near-zero p-value so the null hypothesis can be rejected.

Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

plot(x, y)
abline(fit, lwd = 3, col = 2)
abline(-1, 0.5, lwd = 3, col = 3)
legend(-1, legend = c("model fit", "pop. regression"), col = 2:3, lwd = 3)

Now fit a polynomial regression model that predicts y using x and x2. Is there evidence that the quadratic term improves the model fit?

fit1 = lm(y ~ x + I(x^2))
summary(fit1)
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2428 -0.2535  0.0137  0.3485  1.2712 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.947344   0.060802 -15.581   <2e-16 ***
## x            0.490788   0.047230  10.392   <2e-16 ***
## I(x^2)      -0.002217   0.032762  -0.068    0.946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4925 on 97 degrees of freedom
## Multiple R-squared:  0.5271, Adjusted R-squared:  0.5173 
## F-statistic: 54.05 on 2 and 97 DF,  p-value: < 2.2e-16

There is evidence that model fit has increased over the training data given the slight increase in R2 and RSE. Although, the p-value of the t-statistic suggests that there isn’t a relationship between y and x2.

Repeat after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the vari- ance of the normal distribution used to generate the error term ε

x1 = rnorm(100)
esp1 = rnorm(100, 0, 0.1)
y1 = -1 + 0.5 * x1 + esp1
par(mfrow = c(1,1))
plot(x1, y1)

fit1 = lm(y1 ~ x1)
summary(fit1)
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25031 -0.07919  0.00240  0.05670  0.38216 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.00058    0.01101  -90.92   <2e-16 ***
## x1           0.50505    0.01052   47.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1098 on 98 degrees of freedom
## Multiple R-squared:  0.9592, Adjusted R-squared:  0.9588 
## F-statistic:  2303 on 1 and 98 DF,  p-value: < 2.2e-16

Repeat after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term ε

x2 = rnorm(100)
esp2 = rnorm(100, 0, 1)
y2 = -1 + 0.5 * x2 + esp2
par(mfrow = c(1,1))
plot(x2, y2)

fit2 = lm(y2 ~ x2)
summary(fit2)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68380 -0.85153 -0.09211  0.91308  2.17018 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.0879     0.1107  -9.823 2.93e-16 ***
## x2            0.7261     0.1063   6.829 7.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.103 on 98 degrees of freedom
## Multiple R-squared:  0.3224, Adjusted R-squared:  0.3155 
## F-statistic: 46.63 on 1 and 98 DF,  p-value: 7.23e-10

What are the confidence intervals for β0 and β1 based on the original data set, the noisier data set, and the less noisy data set?

confint(fit)
##                  2.5 %     97.5 %
## (Intercept) -1.0470067 -0.8525044
## x            0.3974859  0.5838434
confint(fit1)
##                  2.5 %     97.5 %
## (Intercept) -1.0224169 -0.9787372
## x1           0.4841616  0.5259337
confint(fit2)
##                  2.5 %     97.5 %
## (Intercept) -1.3076392 -0.8680943
## x2           0.5150973  0.9371055

All intervals seem to be centered on approximately 0.5, with the second fit’s interval being narrower than the first fit’s interval and the last fit’s interval being wider than the first fit’s interval.

Question 14 - Problem of collinearity

set.seed(1)
x1 = runif(100)
x2 = 0.5 * x1 + rnorm(100)/10
y = 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2

What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1, x2)
## [1] 0.8351212

correlation = 0.8351212

plot(x1, x2)

Using this data, fit a least squares regression to predict y using x1 and x2

fit = lm(y ~ x1 + x2)
summary(fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The coefficients β̂0, β 1 and β̂2 are respectively 2.1304996, 1.4395554 and 1.0096742. Only β̂ 0 is close to β0. As the p-value is less than 0.05 we may reject H0 for β1, however we may not reject H0 for β2 as the p-value is higher than 0.05.